CONTEXT: Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.

OBJECTIVE: Company’s management wants to invest on proposal on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them.

**Steps and tasks:

  1. Read the data set, clean the data and prepare a final dataset to be used for analysis.

Type casting creates error where non numerical values are present

as seen above, there are several "-" in row index 60 lets drop that row as the team has never played games, leaving us with no clues about their performance.

The TeamLaunch column contains year of launch, either in Gregorian calendar year like YYYY or probably financila/academic year ranges like YYYY-YY Hence lets extract the Launch Year in Gregorian Format of YYYY alone

All columns converted without error

Rearrange dataset and convert Team name as the index and lets check the info & description

All invalid data points are cleaned and is ready for analysis

Perform detailed statistical analysis and EDA using univariate, bi-variate and multivariate EDA techniques to get data driven insights on recommending which teams they can approach which will be a deal win for them. Also as a data and statistics expert you have to develop a detailed performance report using this data.

The attributes of the data are varying in scales ranging from 10s to 1000s

From above statistics we are able to find significant outliers in 3 columns. Those outliers in Score, WonGames, BasketScored cannot be excluded from datapoints as those exceptional performances are defining the top teams

The attributes do not follow normal distribution, probably because of various generations of teams being compared here (TeamLaunch ranges over 60 years)

lets study attribute-wise distribution to get a better picture

With most of the attributes being right skewed and none following a normal distribution, it will be difficult to determine better performing teams

Bivariate Analysis

Quite a lot of attributes are found to be related either positively or inversely

let us review the correlation coefficient to measure the relationships

Feature Selection & Engineering

Now, the refined attributes define the group of teams more accurately

Having arrived at meaningful qualities of the group of teams one may intuitively choose teams with high WinRatio to invest on So lets see if that is a worthy of investment

Interestingly, Yes the teams with high WinRatios have been TournamentChampions for several times (Teams 1 to 5) But those are the oldest teams amongst the group, and are expected to have been contract with Competitors

So who are we left with? with only young teams!!! Surprisingly, among teams not older than 25 years, there are 2 budding performers with high perseverence Teams 21 & 25 has shown high interest to frequently play

Summary

I Recommend Company X to invest on Teams 21 & 25 for assured grand success

  1. Please include any improvements or suggestions to the association management on quality, quantity, variety, velocity, veracity etc. on the data points collected by the association to perform a better data analysis in future.

Part C

CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.

OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.

All the attributes are found to be of object datatype Going forwards, Funding column must be considered for appropriate conversion

There are a total of 216 counts of nulls and Nans

Data visualisation

The above graph indicates heavy skewness in data, also depicting a whole lot of 60 records of Funding values as outliers.

But comparing with the sample size of just 446, the count of outliers is acounting to 13.45%

labeling more than 10% of available sample as outliers and excluding them from further analysis will greatly influence the sample data distribution hence let us try transforming the Funding data, to obtain better clarity on data

Statistical Analysis

Based on above visulisations

The above description will suggest that the means and spread of Funding significantly varies between Operating & closed Companies. But, the influence of skewness could raise an ambiguity over the inference.

Hence lets review the same in log transformed data

The previous inference is supported by log transformed data also Funds allocated to Closed companies are far less than those allocated to successfully operating companies

Let us also verify the same using a 2 sample T test

Null hypothesis Ho

Alternate Hypothesis Ha

Conclusion: The above test reiterates that the funds allocated are not similar

Calculate percentage of winners that are still operating and percentage of contestants that are still operating

Considering all recognised companies as winners, for the sake of analysis

Z test of proportions

Conclusion: compannies recognised in the Startup Battlefield event have survived better than the remaining contestants

TC50 2008 & 2009 has seen maximum number of contestants